Task 1: How much should insurance cost? (25 points)

library(tidyverse)
#install.packages("modelsummary")
library(modelsummary)
library(ggplot2)

# Load insurance data
insurance <- read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Assignments/Homework/data/hw1/insurance.csv")

We have a dataset of over 1,300 people with their insurance charges and socio-demographic characteristics.


Exploratory analysis

What is the relationship between charges and age? This plot shows some initial trends:

Question 1.1:

Describe the relationship between age and charges based on the previous plot.

Answer: Looking at the previous plot, there is a positive relationship between age and charges. However, there are important differences between charges for individuals of the same age, indicating that there’s high heterogeneity that it is not accounted for.

Question 1.2:

Make a new plot that colors these points by whether a person is a smoker or not. What can you tell about the relationship between age and charges for these groups?

Answer: Both groups (smokers and non-smokers) seem to have a similar positive correlation between charges and age. However, it is clear that non-smokers have overall lower charges related to insurance. The dispersion for non-smokers at a specific age also seems to be much smaller compared to the dispersion associated to smokers at a fixed age.


ggplot(data = insurance, 
       aes(x = age, y = charges, color = smoker)) +
  geom_point(pch = 21, size = 3) +
  theme_bw()

Question 1.3:

Now, show a new plot with the relationship between BMI and charges. What can you say about that relationship? Does the relationship look linear?

Answer: There is a positive relationship between BMI and charges, but it does not look very linear (the relationship looks more quadratic)


ggplot(data = insurance, 
       aes(x = bmi, y = charges)) +
  geom_point(color = "dark orange", fill = alpha("dark orange", 0.5), size = 3, pch = 21) +
  theme_bw()


Let’s build some models

Question 1.4:

Fit a model that regresses charges on BMI, smoking, and its interaction. Interpret the coefficients. Do these results align with what you expected? Why or why not?

Answer:

  • In this model, the intercept refers to the average charges for an individual who doesn’t smoke and has a BMI of 0, with an estimated average charge of $5,880. Given that it’s not possible to have a BMI of zero, the intercept doesn’t have a applicable interpretations.

  • Regarding the smoker variable, this coefficient captures the difference between a person who smokes and a person who doesn’t, but that has a BMI of zero. In other words, it is the difference between the value of the intercept for a smoker vs a non-smoker. In this model, for a non-smoker with BMI of 0, their average charges are $19,066 less than a non-smoker. Just as the previous intercept, it does not have an applicable interpretation.

  • The BMI variable captures the increase in price charges given one-point change in BMI, for a non-smoker. In this case, for a one-point increase in BMI, a non-smoker has an estimated increase in their insurance charges of $83

  • Finally, the interaction term captures the difference between the change in charges for smokers vs. non-smokers. In this model, for a one-point increase in BMI, smokers face, on average, an increase of $1390 more than non-smokers.

Overall, the signs of these coefficients have the sign we would expect (an increase in charges associated with BMI, which is steeper for smokers). The only coefficient that might catch people’s attention is the sign on “smoker”, but we have to remember that it does not have an applicable interpretation, as it’s referring to individuals with a BMI of 0, and just acts more as a level adjustment for average charges for smokers.


lm1 <- lm(charges ~ smoker + bmi + smoker*bmi, data = insurance)

modelsummary(lm1, stars = TRUE, gof_omit = 'DF|AIC|BIC|Log.Lik.|F')
Model 1
(Intercept) 5879.424***
(976.869)
smokeryes -19066.000***
(2092.028)
bmi 83.351**
(31.269)
smokeryes × bmi 1389.756***
(66.783)
Num.Obs. 1338
R2 0.742
R2 Adj. 0.741
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Question 1.5:

Using the previous information, build a regression using age, body mass index, smoker category, and sex as covariates, and charges as the dependent variable. Include an interaction between age and BMI. What is the interpretation for the partial relationship between BMI and insurance charges for someone who is 25 years of age versus someone who is 70, holding other variables constant?

Answer: For someone who is 25 years old, for one additional point in BMI, their charges are expected to increase by \(310 + 25\times 0.329 = \$318\) with a one-point increase in BMI, holding other variables constant. On the other hand, for a person who is 70 years old, their charges are expected to increase on average \(310 + 70\times 0.329 = \$333\) for a one-point increase on BMI, holding other variables constant. However, because the interaction term is not statistically significant, we cannot reject that the change in prices associated to one-point increase of the BMI is the same for all ages.


lm2 <- lm(charges ~ age*bmi + smoker + bmi + sex, data = insurance)

modelsummary(lm2, stars = TRUE, gof_omit = 'DF|AIC|BIC|Log.Lik.|F')
Model 1
(Intercept) -11245.497***
(2487.017)
age 249.290***
(61.404)
bmi 310.403***
(79.853)
smokeryes 23835.607***
(414.465)
sexmale -110.109
(334.847)
age × bmi 0.329
(1.949)
Num.Obs. 1338
R2 0.748
R2 Adj. 0.747
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Task 2: Food access and mortality (25 points)

food_health <- read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Assignments/Homework/data/hw1/food_health_politics.csv") 

We’re interested in looking at the relationships between food access, mortality, and politics. Do do this, we look at data from three different sources:

Each row in the dataset is a US county. The main outcome we care about is mortality_rate, or the number of deaths per 100,000 people in a county between 2013-2015. Other interesting variables in the dataset include:

Exploratory analysis

Question 2.1:

Let’s look at the data. Plot a histogram of the outcome variable and describe your findings. Do the same but with the covariate that captures the percentage of the population with low access to food. Note: Remember to label your axis appropriately.

Answer: Looking at the histogram below, we can see that our outcome looks like it follows a normal distribution centered around 850 deaths per 100,000 residents. We can also observe that we lose 8 observations, which indicates that the outcome has 8 missing values.


ggplot(data = food_health, aes(x = mortality_rate)) + 
  geom_histogram(color = "dark orange", fill = alpha("dark orange", 0.5), lwd = 1.2) +
    theme_bw() +
  xlab("Mortality rate") + ylab("Count")

In the case of the variable for food access, we can see that the distribution is right-skewed, with a long right tail. We also see that we have around 100 counties that have a 100% of their population with low access to food. Additionally, this plot drops 19 observations, because they have missing values.


ggplot(data = food_health, aes(x = pct_low_access_pop)) + 
  geom_histogram(color = "dark orange", fill = alpha("dark orange", 0.5), lwd = 1.2) +
    theme_bw() +
  xlab("% Pop. with Low Access to Food") + ylab("Count")

Question 2.2:

Create a new data set that removes missing (NA) values from your dependent variable and from pct_low_access_pop. Name the new dataset food_health_clean. (Hint: You might find the function is.na() useful! is.na(x) will return TRUE if x is missing and FALSE in the other case. You can also use !is.na(x) with will do the exact opposite as is.na(x))


food_health_clean <- food_health %>% filter(!is.na(mortality_rate) & !is.na(pct_low_access_pop))

Question 2.3:

How related are mortality rate and access to food? Show an appropriate plot and estimate the correlation between both variables (Hint: Include a regression line)

Answer: From the plot below, there is a negative correlation (-.16) between percentage of population with low access to food and mortality rate. However, the correlation seems to be somewhat small. (It is also considered correct if the slope of the fitted line was estimated).


ggplot(data = food_health_clean, aes(x = pct_low_access_pop, y = mortality_rate)) + 
  geom_point(color = "dark orange", fill = alpha("dark orange", 0.5), size = 3, pch = 21) +
  geom_smooth(method = "lm", color = "#900DA4", lwd = 1.5, se = FALSE) +
    theme_bw() +
  xlab("% of Pop. with Low Access to Food") + ylab("Mortality rate")


food_health_clean %>% select(pct_low_access_pop, mortality_rate) %>% cor(.)
##                    pct_low_access_pop mortality_rate
## pct_low_access_pop          1.0000000     -0.1631545
## mortality_rate             -0.1631545      1.0000000

Question 2.4:

Now we want to look at the relationship between mortality rate and other variables. How related are mortality rate and the prevalence of fast food restaurants per 1,000 individuals? And mortality and the prevalence of SNAP stores per 1,000 residents? And mortality rate and the percent of the county that voted for Democrats in 2012? Show the appropriate plots and explain your conclusions.

Answer:


library(patchwork)

g1 <- ggplot(data = food_health, aes(x = fastfood_per_1000, y = mortality_rate)) + 
  geom_point(color = "dark orange", fill = alpha("dark orange", 0.5), size = 3, pch = 21) +
  geom_smooth(method = "lm", color = "#900DA4", lwd = 1.5, se = FALSE) +
    theme_bw() +
  xlab("Fast food restaurants per 1,000 residents") + ylab("Mortality rate") + ggtitle("Fast Food")

g2 <- ggplot(data = food_health, aes(x = snap_stores_per_1000, y = mortality_rate)) + 
  geom_point(color = "dark orange", fill = alpha("dark orange", 0.5), size = 3, pch = 21) +
  geom_smooth(method = "lm", color = "#900DA4", lwd = 1.5, se = FALSE) +
    theme_bw() +
  xlab("SNAP stores per 1,000 residents") + ylab("Mortality rate") + ggtitle("SNAP Stores")

g3 <- ggplot(data = food_health, aes(x = per_dem_2012, y = mortality_rate)) + 
  geom_point(color = "dark orange", fill = alpha("dark orange", 0.5), size = 3, pch = 21) +
  geom_smooth(method = "lm", color = "#900DA4", lwd = 1.5, se = FALSE) +
    theme_bw() +
  xlab("% votes for Democrats") + ylab("Mortality rate") + ggtitle("Democratic Vote")

g1 + g2 + g3

Let’s run some models

Question 2.5:

Is access to food associated with mortality? Run a simple linear model on your clean dataset and explain the coefficients. Does the main coefficient have the sign you would expect? (Hint: Be aware of the units of each of the variables)

Answer: There is a significant (at conventional levels) and negative association between food access and mortality rate. A 1% increase in the percentage of people with low access to food (a 0.01 increase in value for this variable) is associated with a 1.24 decrease in mortality rate (1.24 less deaths per 100,000 residents). This result is counter-intuitive, given that you would think that if a larger percentage of the population has low access to food, that would mean that the mortality rate might increase.


lm3 <- lm(mortality_rate ~ pct_low_access_pop, data = food_health_clean)

modelsummary(lm3, stars = TRUE, gof_omit = 'DF|AIC|BIC|Log.Lik.|F')
Model 1
(Intercept) 845.314***
(4.056)
pct_low_access_pop -124.516***
(13.493)
Num.Obs. 3116
R2 0.027
R2 Adj. 0.026
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Question 2.6:

Create a new data set from food_health_clean, but remove the outliers from the independent variable. Do your results change compared to the previous question?

Answer: We remove counties that show a population of 100% with low access to food. We find a very similar result compared to our previous model with a complete clean dataset, which means that this association is not driven by those outlier counties.


food_health_clean_no_outliers <- food_health_clean %>% filter(pct_low_access_pop<1)

lm4 <- lm(mortality_rate ~ pct_low_access_pop, data = food_health_clean_no_outliers)

modelsummary(lm4, stars = TRUE, gof_omit = 'DF|AIC|BIC|Log.Lik.|F')
Model 1
(Intercept) 845.409***
(4.172)
pct_low_access_pop -125.043***
(14.615)
Num.Obs. 3086
R2 0.023
R2 Adj. 0.023
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Question 2.7:

Finally, run a model of mortality rate on the proportion of the population with low access to food, number of SNAP stores per 1,000 residents, and percentage of democratic vote in 2012. Interpret the coefficients for snap stores and democratic vote.

Answer:


lm5 <- lm(mortality_rate ~ pct_low_access_pop + snap_stores_per_1000 + per_dem_2012, data = food_health_clean)

modelsummary(lm5, stars = TRUE, gof_omit = 'DF|AIC|BIC|Log.Lik.|F')
Model 1
(Intercept) 722.079***
(9.217)
pct_low_access_pop -161.011***
(12.346)
snap_stores_per_1000 201.052***
(6.489)
per_dem_2012 -144.093***
(15.403)
Num.Obs. 3068
R2 0.273
R2 Adj. 0.272
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Task 3: Who is better at shooting hoops?

nba <- read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Assignments/Homework/data/hw1/nba.csv") 

You are the brand new manager for a NBA team, and are trying to moneyball1 this and look at data to get some idea of the characteristics that a successful player might have.

You have historic data for players’ statistics as rookies and whether they are still in the NBA 5 years later. The dataset has the following variables:

You want to build a long-term dream team, so the main outcome you are interested in whether a player is still in the league after 5 years.

Note: If we have a regression \(Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\), where the outcome \(Y\) is a binary variable, we can interpret \(\hat{\beta}_1\) as a change in probability of \(Y=1\) happening. In this case, a one-unit increase in \(X\) would be associated with a \(\hat{\beta}_1\) increase in the probability of \(Y=1\). Also, remember that for a binary variable \(\mathbb{E}(Y) = Pr(Y=1)\).

Let’s fit some models

Question 3.1

First, create a new variable that captures the total number of hours played as a rookie. Name that new variable TMIN.

Answer: We can create the new variable by multiplying the average number of minutes played per game and the number of games played as a rookie. We also divide this number by 60 to convert the units to hours.


nba <- nba %>% mutate(TMIN = GP*MIN/60)

Question 3.2

Fit a model that includes the total number of hours played during rookie year and field goal percentage. Interpret the coefficients for this regression.

Answer: In this case, the intercept captures the average probability of a player to still be in the NBA 5 years later if during their rookie year they did not play at all and did not make any field goals. The coefficient for TMIN tells us that one additional hour of game time is associated with a .013 increase in the probability of staying in the NBA, holding the percentage of field goals constant. Finally, a 1% increase in FGP (note that FGP is measured in a 0-100 scale) is associated with an estimated average increase of .01 in the probability of staying in the league, holding time played constant.


lm6 <- lm(TARGET_5Yrs ~ TMIN + FGP, data = nba)

modelsummary(lm6, stars = TRUE, gof_omit = 'DF|AIC|BIC|Log.Lik.|F')
Model 1
(Intercept) -0.137
(0.088)
TMIN 0.013***
(0.001)
FGP 0.011***
(0.002)
Num.Obs. 1340
R2 0.163
R2 Adj. 0.162
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Question 3.3

Now, fit a model that includes the total number of hours played during rookie year and field goals made. How do these results compare to your previous model? Explain why do you think that is.

Answer: We can see that FGM has a negative coefficient, but it is not statistically significant at conventional levels (we cannot reject the null hypothesis that the coefficient is different from 0). Even though FGM and FGP are associated, the main difference is that FGM is much more correlated with time played, so a reason that we don’t see a significant association (and a switch in sign) could be because of some multicollinearity of both variables.


lm7 <- lm(TARGET_5Yrs ~ TMIN + FGM, data = nba)

modelsummary(lm7, stars = TRUE, gof_omit = 'DF|AIC|BIC|Log.Lik.|F')
Model 1
(Intercept) 0.342***
(0.023)
TMIN 0.017***
(0.002)
FGM -0.016
(0.015)
Num.Obs. 1340
R2 0.145
R2 Adj. 0.144
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Let’s standardize some variables

Question 3.4

You want to run a regression now that will consider total time played as a rookie, percentage of field goals, and percentage of free throws. Without transforming any variables, can you say which attribute has a larger association with the probability of still being in the NBA 5 years later? Why or why not?

Answer: We can’t compare the magnitude of the coefficients directly, because the covariates are in different units.

Question 3.5

Standardize your covariates to fit the previous model. What can you say about these results? Which is the variable that you would pay most attention to when looking to hire a player given this information? Interpret the coefficient for TMIN.

Answer: We can see from these results that time played is the variable that is more strongly associated with the probability of still being in the league 5 years later. For an increase of one standard deviation in number of hours played during rookie year, the estimated probability of staying in the league increases by .16 on average, holding other variables constant.


scale2 <- function(x, na.rm = FALSE){
  return((x - mean(x, na.rm = na.rm)) / sd(x, na.rm))
} 

nba_std <- nba %>% mutate_at(vars(TMIN, FGP, FTP),
                             ~scale2(.,na.rm=TRUE))

lm8 <- lm(TARGET_5Yrs ~ TMIN + FGP + FTP, data = nba)
lm9 <- lm(TARGET_5Yrs ~ TMIN + FGP + FTP, data = nba_std)

modelsummary(list("Non-Std Model" = lm8, "Std Model" = lm9), stars = TRUE, gof_omit = 'DF|AIC|BIC|Log.Lik.|F')
Non-Std Model Std Model
(Intercept) -0.348** 0.620***
(0.134) (0.012)
TMIN 0.013*** 0.160***
(0.001) (0.013)
FGP 0.012*** 0.076***
(0.002) (0.013)
FTP 0.003* 0.027*
(0.001) (0.013)
Num.Obs. 1340 1340
R2 0.166 0.166
R2 Adj. 0.164 0.164
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Task 4: What’s up with Twitter? (25 points)

tweets <- read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Assignments/Homework/data/hw1/USElectionTweets2016.csv", 
                   stringsAsFactors = FALSE)

Social media is a powerful tool to analyze how individuals behave on a large scale and how they react to different events. Particularly in elections, data from Twitter or Facebook can be of help to analyze voters in a more dynamic way. The following dataset is a collection of tweets scraped from Twitter related the 2016 Presidential election (it captures four specific dates).

Some of the variables of interest are the following:

We will use this data to analyze what people were saying on Twitter about the 2016 election.

Question 4.1

Given that we are particularly interested in the two candidates that were on the ballot, create a new dataset that only contains tweets related to Hillary Clinton or Donald Trump. Name this new dataset tweets_two. How many observations does your new dataset have.

Answer: The new dataset has 16,325 tweets


tweets_two <- tweets %>% filter(candidate %in% c("Hillary Clinton", "Donald Trump"))

Question 4.2

Look the distribution of polarity and subjectivity for each candidate. What can you say about these distributions? Are there differences in how these variable behave depending on the candidate?

Answer: From the histograms, it is clear that for both candidates most tweets are categorized as neutral. The distribution of polarity looks very similar for both candidates, but overall, Donald Trump has substantially more tweets (positive and negative). In terms of subjectivity, again tweets for both candidates seem to behave similarly, but tweets regarding Donald Trump do appear to be more subjective than the ones associated with Hillary Clinton. The majority of the tweets are also classified as factual for both candidates.


ggplot(data = tweets_two, aes(x = polarity, color = factor(candidate), fill = factor(candidate))) +
  geom_histogram(lwd = 1.1) +
  scale_color_manual("Candidates",values=c("#E16462","#F89441")) +
  scale_fill_manual("Candidates",values=c(alpha("#E16462",0.3),alpha("#F89441",0.3))) +
  theme_bw() + ggtitle("Polarity") + xlab("Polarity") + ylab("Count")


ggplot(data = tweets_two, aes(x = subjectivity, color = factor(candidate), fill = factor(candidate))) +
  geom_histogram(lwd = 1.1) +
  scale_color_manual("Candidates",values=c("#E16462","#F89441")) +
  scale_fill_manual("Candidates",values=c(alpha("#E16462",0.3),alpha("#F89441",0.3))) +
  theme_bw() + ggtitle("Subjectivity") + xlab("Subjectivity") + ylab("Count")

Question 4.3

We are going to create a dummy variable for Donald Trump and analyze the sentiment of the tweets. First, we just want to see whether there is a difference between the two candidates for polarity and subjectivity, so you fit the following two models:

tweets_two <- tweets_two %>% mutate(trump = ifelse(candidate == "Donald Trump", 1, 0))

lm_pol <- lm(polarity ~ trump + state_dem, data = tweets_two)
lm_subj <- lm(subjectivity ~ trump + state_dem, data = tweets_two)

modelsummary(list("Polarity" = lm_pol, "Subjectivity" = lm_subj), stars = TRUE, gof_omit = 'DF|AIC|BIC|Log.Lik.|F')
Polarity Subjectivity
(Intercept) 0.030*** 0.289***
(0.006) (0.006)
trump 0.007 0.029***
(0.006) (0.007)
state_dem -0.008+ 0.000
(0.005) (0.005)
Num.Obs. 16325 16325
R2 0.000 0.001
R2 Adj. 0.000 0.001
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Explain what the coefficient for trump means in this case.

Answer: For these models, the coefficient for trump captures the average difference between polarity (and subjectivity) for tweets that mention Dondald Trump compared to those that mention Hillary Clinton. In terms of polarity, there is no significant difference between tweets related to Trump and tweets related to Clinton. On the other hand, tweets related to Trump are .029 more subjective (in a scale of 0-1) than tweets related to Clinton.

Question 4.4

You have a suspicion that polarity and subjectivity behave differently for both candidates depending on whether the tweet comes from a state that is considered democrat or republican. How would you test this? Fit the appropriate model and interpret the coefficients. What conclusions can you make based on this analysis?

Answer:In these models, the intercept captures the average polarity and subjectivity for a tweet mentioning Clinton in a Republican state. The trump coefficient, in this case, captures the average difference on the outcome for tweets mentioning Donald Trump in a Republican state. We can see that in both models, that coefficient is statistically significant, showing that on average, on Republican States, tweets mentioning Donald Trump are more positive and more subjective than those mentioning Hillary Clinton.

In terms of the interaction term, it captures the difference in the outcome for tweets about Trump vs Clinton both in Democratic states compared to Republican states. In the case of the first model, tweets related to Donald Trump are .032 units more negative than Clinton in a Democratic state compared to a Republican state. That difference is not statistically significant for subjectivity, so we cannot reject that the coefficient is equal to 0.


lm_pol2 <- lm(polarity ~ trump*state_dem, data = tweets_two)
lm_subj2 <- lm(subjectivity ~ trump*state_dem, data = tweets_two)

modelsummary(list("Polarity" = lm_pol2, "Subjectivity" = lm_subj2), stars = TRUE, gof_omit = 'DF|AIC|BIC|Log.Lik.|F')
Polarity Subjectivity
(Intercept) 0.017* 0.287***
(0.008) (0.008)
trump 0.024** 0.032***
(0.009) (0.010)
state_dem 0.016 0.005
(0.011) (0.012)
trump × state_dem -0.032** -0.006
(0.012) (0.013)
Num.Obs. 16325 16325
R2 0.001 0.001
R2 Adj. 0.000 0.001
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

  1. If you don’t know the term, you can read the book’s premise here↩︎

  2. Both polarity and subjectivity are measures estimated using Sentiment Analysis, meaning an algorithm detects characteristics from the tweet based on the text.↩︎